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Abstract: Much is now known about the consistency of Bayesian updat- 
ing on infinite-dimensional parameter spaces with independent or Marko- 
vian data. Necessary conditions for consistency include the prior putting 
enough weight on the correct neighborhoods of the data-generating distri- 
bution; various sufficient conditions further restrict the prior in ways analo- 
gous to capacity control in frequentist nonparametrics. The asymptotics of 
Bayesian updating with mis-spccificd models or priors, or non-Markovian 
data, are far less well explored. Here I establish sufficient conditions for 
posterior convergence when all hypotheses are wrong, and the data have 
complex dependencies. The main dynamical assumption is the asymptotic 
cquipartition (Shannon-McMillan-Breiman) property of information the- 
ory. This, along with Egorov's Theorem on uniform convergence, lets me 
build a sieve-like structure for the prior. The main statistical assumption, 
also a form of capacity control, concerns the compatibility of the prior 
and the data-generating process, controlling the fluctuations in the log- 
likelihood when averaged over the sieve-like sets. In addition to posterior 
convergence, I derive a kind of large deviations principle for the posterior 
measure, extending in some cases to rates of convergence, and discuss the 
advantages of predicting using a combination of models known to be wrong. 
An appendix sketches connections between these results and the replicator 
dynamics of evolutionary theory. 
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1. Introduction 

The problem of the convergence and frequentist consistency of Bayesian learning 
goes as follows. We encounter observations X\, X 2 , ■ ■ ., which we would like to 
predict by means of a family of models or hypotheses (indexed by 8). We 
begin with a prior probability distribution Ilo over 0, and update this using 
Bayes's rule, so that our distribution after seeing Xi, X 2 , ■ ■ ■ X t = X[ is IT t . 
If the observations come from a stochastic process with infinite-dimensional 
distribution P, when does U t converge P-almost surely? What is the rate of 
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convergence? Under what conditions will Bayesian learning be consistent, so 
that II t doesn't just converge but its limit is PI 

Since the Bayesian estimate is really the whole posterior probability distri- 
bution n t rather than a point or set in 0, consistency becomes concentration 
of II t around P. One defines some sufficiently strong set of neighborhoods of P 
in the space of probability distributions on X%° , and says that II t is consistent 
when, for each such neighborhood N, lim t ^ oo n t A'' = 1. When this holds, the 
posterior increasingly approximates a degenerate distribution centered at the 
truth. 

The greatest importance of these problems, perhaps, is their bearing on the 
objectivity and reliability of Bayesian inference; consistency proofs and conver- 
gence rates are, as it were, frequentist licenses for Bayesian practices. Moreover, 
if Bayesian learners starting from different priors converge rapidly on the same 
posterior distribution, there is less reason to worry about the subjective or ar- 
bitrary element in the choice of priors. (Such "merger of opinion" results [7] 
are also important in economics and game theory 1 lj . ) Recent years have seen 
considerable work on these problems, especially in the non-parametric setting 
where the model space O is infinite-dimensional [30 . 

Pioneering work by Doob |19j . using elegant martingale arguments, estab- 
lished that when any consistent estimator exists, and P lies in the support of 
Ho, the set of sample paths on which the Bayesian learner fails to converge to 
the truth has prior probability zero. (See [T3] and [15] for extensions of this 
result to non-IID settings, and also the discussion in [55][5T].) This is not, how- 
ever, totally reassuring, since P generally also has prior probability zero, and it 
would be unfortunate if these two measure-zero sets should happen to coincide. 
Indeed, Diaconis and Freedman established that the consistency of Bayesian 
inference depends crucially on the choice of prior, and that even very natural 
priors can lead to inconsistency (see [IB] and references therein) . 

Subsequent work, following a path established by Schwartz [55], has shown 
that, no matter what the true data-generating distribution P, Bayesian updat- 
ing converges along P-almost-all sample paths, provided that (a) P is contained 
in 0, (b) every Kullback-Leibler neighborhood in the has some positive prior 
probability (the "Kullback-Leibler property" ) , and (c) certain restrictions hold 
on the prior, amounting to versions of capacity control, as in the method of 
sieves or structural risk minimization. These contributions also make (d) cer- 
tain dynamical assumptions about the data-generating process, most often that 
it is IID [U [26J [65] (in this setting, [27] and [59] in particular consider conver- 
gence rates), independent non-identically distributed [T3] HH], or, in some cases, 
Markovian [55J [53]; [T3] and [55] work with spectral density estimation and 
sequential analysis, respectively, again exploiting specific dynamical properties. 

For mis-specified models, that is settings where (a) above fails, important 
early results were obtained by Berk [5] [6] for IID data, albeit under rather 
strong restrictions on likelihood functions and parameter spaces, showing that 
the posterior distribution concentrates on an "asymptotic carrier" , consisting of 
the hypotheses which are the best available approximations, in the Kullback- 
Leibler sense, to P within the support of the prior. More recently, [35], [BHJ and 
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|41j have dealt with the convergence of non-parametric Bayesian estimation for 
IID data when P is not in the support of the prior, obtaining results similar 
to Berk's in far more general settings, extending in some situations to rates of 
convergence. All of this work, however, relies on the dynamical assumption of 
an IID data-source. 

This paper gives sufficient conditions for the convergence of the posterior 
without assuming (a), and substantially weakening (c) and (d). Even if one uses 
non-parametric models, cases where one knows that the true data generating 
process is exactly represented by one of the hypotheses in the model class are 
scarce. Moreover, while IID data can be produced, with some trouble and ex- 
pense, in the laboratory or in a well-conducted survey, in many applications 
the data are not just heterogeneous and dependent, but their heterogeneity and 
dependence is precisely what is of interest. This raises the question of what 
Bayesian updating does when the truth is not contained in the support of the 
prior, and observations have complicated dependencies. 

To answer this question, I first weaken the dynamical assumptions to the 
asymptotic equipartition property (Shannon-McMillan-Breiman theorem) of in- 
formation theory, i.e., for each hypothesis 9, the log- likelihood per unit time con- 
verges almost surely. This log-likelihood per unit time is basically the growth 
rate of the Kullback-Leibler divergence between P and 9, h(6). As observa- 
tions accumulate, areas of where h(9) exceeds its essential infimum h(Q) tend 
to lose posterior probability, which concentrates in divergence-minimizing re- 
gions. Some additional conditions on the prior distribution are needed to prevent 
it from putting too much weight initially on hypotheses with high divergence 
rates but slow convergence of the log-likelihood. As the latter assumptions are 
strengthened, more and more can be said about the convergence of the posterior. 

Using the weakest set of conditions (Assumptions [l}|3]) , the long-run expo- 
nential growth rate of the posterior density at cannot exceed h(Q) — h(9) 
(Theorem [lj . Adding Assumptions [Ijj6] to provide better control over the inte- 
grated or marginal likelihood establishes (Theorem [2]) that the long-run growth 
rate of the posterior density is in fact h(Q) — h(9). One more assumption Q 
then lets us conclude (Theorem [3]) that the posterior distribution converges, in 
the sense that, for any set of hypotheses A, the posterior probability n t (A) — > 
unless the essential infimum of h(9) over A equals h(Q). In fact, we then have a 
kind of large deviations principle for the posterior measure (Theorem |4j) , as well 
as a bound on the generalization ability of the posterior predictive distribution 
(Theorem |5| . Convergence rates for the posterior (Theorem [6]) follow from the 
combination of the large deviations result with an extra condition related to 
assumption [6] Importantly, Assumptions [4||7] and so the results following from 
them, involve both the prior distribution and the data-generating process, and 
require the former to be adapted to the latter. Under mis-specification, it does 
not seem to be possible to guarantee posterior convergence by conditions on the 
prior alone, at least not with the techniques used here. 

For the convenience of reader, the development uses the usual statistical vo- 
cabulary and machinery. In addition to the asymptotic equipartition property, 
the main technical tools are on the one hand Egorov's theorem from basic mea- 
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sure theory, which is used to construct a sieve-like sequence of sets on which 
log- likelihood ratios converge uniformly, and on the other hand Assumption [6] 
bounding how long averages over these sets can remain far from their long- 
run limits. The latter assumption is crucial, novel, and, in its present form, 
awkward to check; I take up its relation to more familiar assumptions in the 
discussion. It may be of interest, however, that the results were first found via an 
apparently-novel analogy between Bayesian updating and the "replicator equa- 
tion" of evolutionary dynamics, which is a formalization of the Darwinian idea 
of natural selection. Individual hypotheses play the role of distinct replicators 
in a population, the posterior distribution being the population distribution 
over replicators and fitness being proportional to likelihood. Appendix |A"| gives 
details. 



2. Preliminaries and Notation 

Let (fl,^, P) be a probability space, and Xx,X^, . . ., for short X%°, be a se- 
quence of random variables, taking values in the measurable space (3, X), whose 
infinite-dimensional distribution is P. The natural filtration of this process is 
<r{X\). The only dynamical properties are those required for the Shannon- 
McMillan-Breiman theorem (Assumption [3]); more specific assumptions such 
as P being a product measure, Markovian, exchangeable, etc., are not required. 
Unless otherwise noted, all probabilities are taken with respect to P, and E [•] 
always means expectation under that distribution. 

Statistical hypotheses, i.e., distributions of processes adapted to a(X\), are 
denoted by Fg, the index 9 taking values in the hypothesis space, a measurable 
space (0,T), generally infinite-dimensional. For convenience, assume that P 
and all the Fg are dominated by a common reference measure, with respective 
densities p and fg. I do not assume that P £ 0, still less that P E suppllo — 
i.e., quite possibly all of the available hypotheses arc false. 

We will study the evolution of a sequence of probability measures II t on 
(0,T), starting with a non-random prior measure IIo. (A filtration on O is 
not needed; the measures II t change but not the er-field T .) Assume all II t 
are absolutely continuous with respect to a common reference measure, with 
densities 7r t . Expectations with respect to these measures will be written either 
as explicit integrals or de Finetti style, II t (/) = J f(9)dH t (9); when A is a set, 
IL t (fA)=IL t (fl A ) = j A f(6)<m t (6). 

Let Ltiff) be the conditional likelihood of xt under 9, i.e., L t (9) = fg(X t — 
£t|X' _1 = x\~ l ), with Lq = 1. The integrated conditional likelihood is lit (L t ). 
Bayesian updating of course means that, for any A 6 T, 

LL (W) 

Ut+1 {A) - n t (L t+1 ) 



or, in terms of the density, 

7T t+ l(0) = 



L t+1 {9)n t {9) 
n t (L t+1 ) 
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It will also be convenient to express Bayesian updating in terms of the prior 
and the total likelihood: 

n f A an (6)fo(x{) f A dn (6)^ U (R t A) 



j e dn o (0)f 9 ( x \) j edUo{d) M^ n (i? t ) 



where Rt(0) = is the ratio of model likelihood to true likelihood. (Note 

that < p(x\) < oo for all t P-a.s.) Similarly, 

Rt(0) 



TT t (6) = MO) 



n (R t ) 



The one-step-ahead predictive distribution of the hypothesis 9 is given by 
Fg (X t \a (X^ -1 )) , with the convention that t = 1 gives the marginal distri- 
bution of the first observation. Abbreviate this by Fg. Similarly let P t = 
P (X t \a (A^ -1 )) ; this is the best probabilistic prediction we could make, did 
we but know P [39j . The posterior predictive distribution is given by mixing 
the individual predictive distributions with weights given by the posterior: 



P n = / F*dU t (9) 
Je 



/o 

Remark on the topology of and on T The hope in studying posterior 
convergence is to show that, as t grows, with higher and higher (P) probability, 
Ilj concentrates more and more on sets which come closer and closer to P. The 
tricky part here is "closer and closer" : points in represent infinite-dimensional 
stochastic process distributions, and the topology of such spaces is somewhat 
odd, and irritatingly abrupt, at least under the more common distances. Any 
two ergodic measures are either equal or have completely disjoint supports |31j . 
so that the Kullback-Leibler divergence between distinct ergodic processes is 
always infinity (in both directions), and the total variation and Hcllinger dis- 
tances are likewise maximal. Most previous work on posterior consistency has 
restricted itself to models where the infinite-dimensional process distributions 
are formed by products of fixed-dimensional base distributions (IID, Markov, 
etc.), and in effect transferred the usual metrics' topologies from these finite- 
dimensional distributions to the processes. It is possible to define metrics for 
general stochastic processes [SI], and if readers like they may imagine that T 
is a Borel cr-field under some such metric. This is not necessary for the results 
presented here, however. 



2. 1 . Example 



The following example will be used to illustrate the assumptions ({ 2.2.1 and 
Appendix |b|) , and, later, the conclusions ( p.6| . 

The data-generating process P is a stationary and ergodic measure on the 
space of binary sequences, i.e., S = {0, 1}, and the cr-field X — 2". The measure 
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p = 0.5 : '0 




Fig 1. State-transition diagram for the "even process". The legends on the transition arrows 
indicate the probability of making the transition, and the observation which occurs when the 
transition happens. The observation Xt = 1 when entering or leaving state 1, otherwise it is 
0. This creates blocks of Is of even length, separated by blocks of 0s of arbitrary length. The 
result is a finite-state process which is not a Markov chain of any order. 



is naturally represented as a function of a two-state Markov chain Sf°, with 
S t G {1,2}. The transition matrix is 

T= [ o.o 1.0 " 
[ 0.5 0.5 

so that the invariant distribution puts probability 1 /3 on state 1 and probability 
2/3 on state 2; take Si to be distributed accordingly. The observed process is 
a binary function of the latent state transitions, Xt = if St = St+i = 2 and 
Xt = 1 otherwise. Figure [T] depicts the transition and observation structure. 
Qualitatively, Xf° consists of blocks of Is of even length, separated by blocks of 
0s of arbitrary length. Since the joint process {{St, X t )} 1<t<oc is a stationary 
and ergodic Markov chain, X^° is also stationary, ergodic and mixing. 

This stochastic process comes from symbolic dynamics [33J 137], where it is 
known as the "even process", and serves as a basic example of the class of sofic 
processes [66 , which have finite Markovian representations, as in Figure @ but 
are not Markov at any finite order. (If Xt = 1, X t -\ = I, . . . X t -k — 1 for any 
finite fc, the corresponding St-i must have alternated between one and two, 
but whether St is one or two, and thus the distribution of X t+ i, cannot be 
determined from the length-fc history alone.) More exactly [35], sofic systems or 
"Unitary measures" are ones which are images of Markov chains under factor 
maps, and strictly sofic systems, such as the even process, are sofic systems 
which are not themselves Markov chains of any order. Despite their simplicity, 
these models arise naturally when studying the time series of chaotic dynamical 
systems [3] Q1)J [57J [IB], as well as problems in statistical mechanics [50] and 
crystallography [6"2] . 

Let 0fc be the space of all binary Markov chains of order k with strictly pos- 
itive transition probabilities and their respective stationary distributions; each 
Ofc has dimension 2 k . (Allowing some transition probabilities to be zero cre- 
ates uninteresting technical difficulties.) Since each hypothesis is equivalent to 
a function E k+1 i— > (0, 1], we can give 6^ the topology of pointwise convergence 
of functions, and the corresponding Borel cr-field. We will take 6 = UfeLi ®fc> 
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identifying 0^, with the appropriate subset of Ofc+i- Thus O consists of all 
strictly-positive stationary binary Markov chains, of whatever order, and is 
infinite-dimensional. 



As for the prior IIo, it will be specified in more detail below (i 2.2.1 1. At 
the very least, however, it needs to have the "Kullback-Leibler rate property" , 
i.e., to give positive probability to every e "neighborhood" N e (0) around every 
9 E O, i.e., the set of hypotheses whose Kullback-Leibler divergence from 9 
grows no faster than e: 

N € (8) =\e':e> lim \ [ dx\f e {x\)\og 

(The limit exists for all 8,9' combinations [32.) 

This example is simple, but it is also beyond the scope of existing work on 
Bayesian convergence in several ways. First, the data-generating process P is 
not even Markov. Second, P ^ O, so all the hypotheses are wrong, and the truth 
is certainly not in the support of the prior. {P can however be approximated 
arbitrarily closely, in various process metrics, by distributions from 0.) Third, 
because P is ergodic, and ergodic distributions are extreme points in the space 
of stationary distributions |20j . it cannot be represented as a mixture of dis- 
tributions in G. This means that the Doob-style theorem of Ref. |H] does not 
apply, and even the subjective certainty of convergence is not assured. The re- 
sults of Refs. [38] [68] [6] on mis-specified models do not hold because the data 
are dependent. To be as concrete and explicit as possible, the analysis here will 
focus on the even process, but only the constants would change if P were any 
other strictly sofic process. Much of it would apply even if P were a stochastic 
context-free language or pushdown automaton [12] . where in effect the number 
of hidden states is infinite, though some of the details in Appendix [5] would 
change. 

Ref. |47| describes a non-parametric procedure which will adaptively learn to 
predict a class of discrete stochastic processes which includes the even process. 
Ref. [5H] introduces a frequentist algorithm which consistently reconstructs the 
hidden-state representation of sofic processes, including the even process. Ref. 
|61j considers Bayesian estimation of the even process, using Dirichlet priors for 
finite-order Markov chains, and employing Bayes factors to decide which order 
of chain to use for prediction. 



2.2. Assumptions 

The needed assumptions have to do with the dynamical properties of the data 
generating process P, and with how well the dynamics meshes both with the 
class of hypotheses and with the prior distribution LTo over those hypotheses. 

Assumption 1 The likelihood ratio Rt(9) is a (X\) x T -measurable for all t. 

The next two assumptions actually need only hold for n -almost-all 9. But 
this adds more measure-0 caveats to the results, and it is hard to find a natural 
example where it would help. 
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Assumption 2 For every 9 € O, the Kullback-Leibler divergence rate from P , 



h{9) = lim -E 

t— »oo t 



/.(*!) 

exists (possibly being infinite) and is T -measurable. 

As mentioned, any two distinct ergodic measures are mutually singular, so 
there is a consistent test which can separate them. ( |53j constructs an explicit 
but not necessarily optimal test.) One interpretation of the divergence rate [32] 
is that it measures the maximum exponential rate at which the power of such 
tests approaches 1, with d = and d = oo indicating sub- and supra- exponential 
convergence, respectively. 

Assumption 3 For each 9 € 0, the generalized or relative asymptotic equipar- 
tition property holds, and so 

limUog R t (9) = -h(0) (1) 

with P -probability 1. 

Refs. [U [32] give sufficient, but not necessary, conditions sufficient for As- 
sumption [3] to hold for a given 9. The ordinary, non-relative asymptotic equipar- 
tition property, also known as the Shannon-McMillan-Breiman theorem, is that 
lim< _1 \ogp{x\ ) = —hp a.s., where hp is the entropy rate of the data-generating 
process. (See [32].) If this holds and hp is finite, one could rephrase Assump- 
tion [3] as limi -1 log fg(X[) = —hp — h{9) a.s., and state results in terms of 
the likelihood rather than the likelihood ratio. (Cf. [24, ch. 5].) However, there 
are otherwise-well-behaved processes for which hp = —00, at least in the usual 
choice of reference measure, so I will restrict myself to likelihood ratios. 

The meaning of Assumption [3] is that, relative to the true distribution, the 
likelihood of each 9 goes to zero exponentially, the rate being the Kullback- 
Leibler divergence rate. Roughly speaking, an integral of exponentially-shrinking 
quantities will tend to be dominated by the integrand with the slowest rate of 
decay. This suggests that the posterior probability of a set A C depends on 
the smallest divergence rate which can be attained at a point of prior support 
within A. Thus, adapting notation from large deviations theory, define 

h(A) = ess inf h(9) 

J{9) = h{9) - h(Q) 
J (A) = ess inf J(9) 

where here and throughout ess inf is the essential infimum with respect to LTo, 
i.e., the greatest lower bound which holds with Ilo-probability 1. 

Our further assumptions are those needed for the "roughly speaking" and 
"should" statements of the previous paragraph to be true, so that, for reasonable 
sets AeT, 

lim i log n A) = -h{A) 



C. R. Shalizi/ 'Dynamics of Bayesian Updating 



9 



Let I={0: h(6) = 00} . 
Assumption 4 LTo(I) < 1 

If this assumption fails, then every hypothesis in the support of the prior 
doesn't just diverge from the true data-generating distribution, it diverges so 
rapidly that the error rate of a test against the latter distribution goes to zero 
faster than any exponential. (One way this can happen is if every hypothesis 
has a finite-dimensional distribution assigning probability zero to some event of 
positive P-probability.) The methods of this paper seem to be of no use in the 
face of such extreme mis-specification. 

Our first substantial assumption is that the prior distribution does not give 
too much weight to parts of O where the log likelihood converges badly. 

Assumption 5 There exists a sequence of sets Gt — > O such that 

1- n (Gf) > 1 - aexp{-t/3}, for some a > 0, f3 > 2h(G); 

2. The convergence of Eq. [7] is uniform in 6 over Gt \ I; 

3. h(G t )^h(e). 

Comment 1: An analogy with the method of sieves |25] may clarify the mean- 
ing of the assumption. If we were constrained to some fixed G, the uniform 
convergence in the second part of the assumption would make the convergence 
of the posterior distribution fairly straightforward. Now imagine that the con- 
straint set is gradually relaxed, so that at time t the posterior is confined to Gt, 
which grows so slowly that convergence is preserved. (Assumption [6] below is, 
in essence, about the relaxation being sufficiently slow.) The theorems work by 
showing that the behavior of the posterior distribution on the full space O is 
dominated by its behavior on this "sieve" . 

Comment 2: Recall that by Egorov's theorem [35J Lemma 1.36, p. 18], if a 
sequence of finite, measurable functions ft (8) converges pointwise to a finite, 
measurable function f(9) for Ilo-almost-all 8 £ G, then for each e > 0, there 
is a (possibly empty) B C G such that Ho(B) < e, and the convergence is 
uniform on G\ B. Thus the first two parts of the assumption really follow for 
free from the measurability in 6 of likelihoods and divergence rates. (That (3 
needs to be at least 2h(&) becomes apparent in the proof of Lemma [5] but that 
could always be arranged.) The extra content comes in the third part of the 
assumption, which could fail if the lowest-divergence hypotheses were also the 
ones where the convergence was slowest, consistently falling into the bad sets B 
allowed by Egorov's theorem. 

For each measurable AC8, for every 5 > 0, there exists a random natural 
number t(A, S) such that 

t -1 logll (ARt) < tf + limsupt^logllo (AR t ) 

t 

for all t > t(A, 5), provided the limsup is finite. We need this random last-entry 
time t(A, S) to state the next assumption. 
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Assumption 6 The sets G t of the previous assumption can be chosen so that, 
for every S, the inequality t > r(Gt,S) holds a.s. for all sufficiently large t. 

The fraction of the prior probability mass outside of Gt is exponentially small 
in t, with the decay rate large enough that (Lemma|5f the posterior probability 
mass outside Gt also goes to zero. Using the analogy to the sieve again, the 
meaning of the assumption is that the convergence of the log-likelihood ratio is 
sufficiently fast, and the relaxation of the sieve is sufficiently slow, that, at least 
eventually, every set Gt has ^-converged by t, the time when we start using it. 

To show convergence of the posterior measure, we need to be able to control 
the convergence of the log-likelihood on sets smaller than the whole parameter 
space. 

Assumption 7 The sets Gt of the previous two assumptions can be chosen so 
that, for any set A with U (A) > 0, h(G t nA)-> h{A). 

Assumption [7] could be replaced by the logically- weaker assumption that for 
each set A, there exist a sequence of sets G tj A satisfying the equivalents of As- 
sumptions [5] and [6] for the prior measure restricted to A. Since the most straight- 
forward way to check such an assumption would be to verify Assumption [7] as 
stated, the extra generality does not seem worth it. 

2.2.1. Verification of Assumptions for the Example 

Since every € is a finite-order Markov chain, and P is stationary and ergodic, 
Assumption [T] is unproblematic, while Assumptions [2] and [3] hold by virtue of 

HI- 

It is easy to check that infe e e fc h{9) > for each k. (The infimum is not in 
general attained by any 9 € 0^, though it could be if the chains were allowed 
to have some transition probabilities equal to zero.) The infimum over as 
a whole, however, is zero. Also, h{9) < 00 everywhere (because none of the 
hypotheses' transition probabilities are zero), so the possible set I of 9 with 
infinite divergence rates is empty, disposing of Assumption |4j 

Verifying the remaining assumptions means building a sequence G t of in- 
creasing subsets of on which the convergence of t^ 1 log R t is uniform and suf- 
ficiently rapid, and ensuring that the prior probability of these sets grows fast 
enough. This will be done by exploiting some finite-sample deviation bounds for 
the even process, which in turn rest on its mixing properties and the method 
of types. Details are referred to Appendix [B] The upshot is that the sets G t 
consist of chains whose order is less than or equal to yi+e ~ ^' f° r some £ > 0, 
and where the absolute logarithm of all the transition probabilities is bounded 
by Ct 1 , where the positive constant C is arbitrary but < 7 < 2 0^{ 2 ■ (With 
a different strictly sofic process P, the constant 2/3 in the preceding expressions 
should be replaced by hp.) The exponential rate > for the prior probability 
of G c t can be chosen to be arbitrarily small. 
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3. Results 



I first give the theorems here, without proof. The proofs, in §S 3.1 3.5 are ac- 
companied by re-statements of the theorems, for the reader's convenience. 

There are six theorems. The first upper-bounds the growth rate of the pos- 
terior density at a given point 9 in 0. The second matches the upper bound on 
the posterior density with a lower bound, together providing the growth-rate for 
the posterior density. The third is that Ht(A) — > for any set with J (A) > 0, 
showing that the posterior concentrates on the divergence-minimizing part of 
the hypothesis space. The fourth is a kind of large deviations principle for the 
posterior measure. The fifth bounds the asymptotic Hcllinger and total vari- 
ation distances between the posterior predictive distribution and the actual 
conditional distribution of the next observation. Finally, the sixth theorem es- 
tablishes rates of convergence. 

The first result uses only Assumptions [T]-[3] (It is not very interesting, how- 
ever, unless [4] is also true.) The latter three, however, all depend on finer control 
of the integrated likelihood, and so finer control of the prior, as embodied in 
Assumptions [5f|6] More exactly, those additional assumptions concern the inter- 
play between the prior and the data-generating process, restricting the amount 
of prior probability which can be given to hypotheses whose log-likelihoods 
converge excessively slowly under P. I build to the first result in the next sub- 
section, then turn to the control of the integrated likelihood and its consequences 
in the next three sub-sections, and then consider how these results apply to the 
example. 

Theorem 1 Under Assumptions with probability 1, for all 9 where ito{9) > 
0, 

lim sup -log 7r t ( 9) < - J (9) 

t — >oo t 

Theorem 2 Under Assumptions^^ for all 9 € 6 where ttq(9) > 0, 

lim i]og7r t (0) = -J(0) 

t^oo t 

with probability 1. 

Theorem 3 Make Assumptions^^ Pick any set A € T where Hq{A) > and 
h{A) > h(Q). Then U t (A) -► a.s. 

Theorem 4 Under the conditions of Theorem^ if A £ T is such that 
-limsup<- 1 logn (ylnG9 = > 2h(A) 

then 

lim llogU t (A) = -J(A) 
In particular, this holds whenever 2h{A) < (3 or A C Hfe^=n /< 



or some n. 
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Theorem 5 Under Assumptions^^ with probability 1, 
limsup < h(Q) 

t— »oo 

Umsupp| y (P*,i^) < 4/t(9) 

t — >oo 

where pn and pxv ar ^j respectively, the Hellinger and total variation metrics. 

Theorem 6 Make assumptions^^ and pick a positive sequence et where tt — ► 
0, te t — > oo. //, for each 5 > 0, 

r{G t r\N c tt ,8)<t 

eventually almost surely, then 
with probability 1. 

3.1. Upper Bound on the Posterior Density 

The primary result of this section is a pointwise upper bound on the growth rate 
of the posterior density. To establish it, I use some subsidiary lemmas, which 
also recur in later proofs. Lemma [2] extends the almost-sure convergence of the 
likelihood (Assumption[3| from holding pointwise in to holding simultaneously 
for all 6 on a (possibly random) set of LTo-measure 1. Lemma [3] shows that the 
prior-weighted likelihood ratio, IIo (R t ) tends to be at least exp {— th(Q)}. (Both 
assertions are made more precise in the lemmas themselves.) 

I begin with a proposition about exchanging the order of universal quantifiers 
(with almost-sure caveats). 

Lemma 1 LetQ C 0xf2 be jointly measurable, with sections Qg = {lo : (u>,6) G Q} 
and = {9 : (u>,6) € Q}. If, for some probability measure P on O, 

V6P(Q e ) = l (2) 

then for any probability measure II on 

P ({cj :U(Q U ) =!}) = ! (3) 



In words, if, for all 9, some property holds a.s., then a.s. the property holds 
simultaneously for almost all 9. 

Proof: Since Q is measurable, for all uj and 9, the sections are measurable, 
and the measures of the sections, P(Qo) and il(<5 w ), are measurable functions 
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of 9 and to, respectively. Using Fubini's theorem, 

/ P(Q g )cHl{9) =11 l Q {u,,0)dP(w)<m(p) 
Je Je Jn 

= [ f l Q {u,6)(Kl{0)dP(w) 
Jn Je 

= [ Il{Q u )dP(u) 
Jn 

By hypothesis, however, P(Qg) = 1 for all 9. Hence it must be the case that 
n(Q w ) = 1 for P- almost- all uj. (In fact, the set of oj for which this is true must 
be a measurable set.) □ 

Lemma 2 Under Assumptions^^ there exists a set C C S°°, with P(C) = 1, 
where, for every y £ C , there exists a Q y £ T such that, for every 9 £ Q y , Eq. 
^holds. Moreover, Uo(Q y ) = 1. 

Proof: Let the set Q consist of the 6, u> pairs where Eq. [I] holds, i.e., for 
which 

lim~log.Ri(0,a;) = -/i(0) , 

being explicit about the dependence of the likelihood ratio on cu. Assumption [3] 
states that \/9P(Q e ) = 1, so applying Lemma [T] just needs the verification that 
Q is jointly measurable. But, by Assumptions [TJand|2] h(-) is T-measurable, and 
Rt(9) is a (X[) x T-measurable for each t, so the set Q where the convergence 
holds are a x T-measurable. Everything then follows from the preceding 

lemma. □ 

Remark: Lemma [2] generalizes Lemma 3 in [3] . Lemma [l] is a specialization 
of the quantifier-reversal lemma used in [IS] to prove PAC-Bayesian theorems 
for learning classifiers. Lemma [T] could be used to extend any of the results 
below which hold a.s. for each 9 to ones which a.s. hold simultaneously almost 
everywhere in 0. This may seem too good to be true, like an alchemist's recipe 
for turning the lead of pointwise limits into the gold of uniform convergence. 
Fortunately or not, however, the lemma tells us nothing about the rate of con- 
vergence, and is compatible with its varying across from instantaneous to 
arbitrarily slow, so uniform laws need stronger assumptions. 

Lemma 3 Under Assumptions for every e > 0, it is almost sure that the 
ratio between the integrated likelihood and the true probability density falls below 
exp {— i(/i(0) + e)} only finitely often: 

P{x?: Il (R t ) <exp{-£(/i(e) + e)}, i.o.} = (4) 

and as a corollary, with probability 1, 



liminf-logn (i? t ) > ~h(0) 

t^oo t 



(5) 
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PROOF: It's enough to show that Eq. [4] holds for all x^° in the set C from 
the previous lemma, since that set has probability 1. 

Let N e / 2 be the set of all 9 in the support of LTo such that h(9) < h(Q) + e/2. 
Since xf £ C, the previous lemma tells us there exists a set Q x ^> of 9 for which 
Eq. [T] holds under the sequence xf . 



exp {t(e + h(e))}U (R t ) = [ R t {9) exp {t(e + h(G))}dn {6) 
R t {0) exp {t(e + h(e))}da (9) 
log R t (8Y 



> 



exp < t 



e+h(G) 



dIl Q (9) 



By Assumption [3j 



lim -logR t (9) = -h(9) 



and for all € N e/2 , h{9) < h(&) + e/2, so 



lim inf exp < t 



e + h(Q)+ 1 - log R t (9) 

a.s., for all 9 e N e / 2 H Q x x - We must have n (-/V e / 2 ) > 0; otherwise /i(6) 
would not be the essential infimum, and we know from the previous lemma that 
Ro(Qxf) = 1, so n (iV e /2 H Q X f) > 0. Thus, Fatou's lemma gives 



lim 

iN e/2 nc 



exp < t 



so 



e+h(e) + -log R t (9) 



lim exp {t(e + h(Q))}H (R t ) = oo 



da (9) = oo 



and hence 

Ho{Rt) >exp{-t(e + h(Q))} (6) 

for all but finitely many t. Since this holds for all xf 3 S C, and P(C) = 1, 
Equation [6] holds a.s., as was to be shown. The corollary statement follows 
immediately. □ 



Theorem 1 Under Assumptions^^ with probability 1, for all 9 where n (9) > 
0, 

limsupilog7r t (0) < -J(0) (7) 

t— »oo t 



Proof: As remarked, 
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- log n t (9) = - log tt (9) + - log R t (9) - - log n (Rt) 
By Assumption [3j for each e > 0, it's almost sure that 

i log 7^(0) < -h(6) + e/2 

for all sufficiently large t, while by Lemma [3j it's almost sure that 

ilo g n (Rt) > -h(Q)-e/2 

for all sufficiently large t. Hence, with probability 1, 

~log7r t (0) <h(Q)-h(6)+e+hogw (6) 

for all sufficiently large t. Hence 

limsup-log7r t (6») < h(Q) - h(9) = -J(d) 

t— >oo t 

□ 

Lemma [3] gives a lower bound on the integrated likelihood ratio, showing that 
in the long run it has to be at least as big as exp {— th(Q))}. (More precisely, 
it is significantly smaller than that on vanishingly few occasions.) It does not, 
however, rule out being larger. Ideally, we would be able to match this lower 
bound with an upper bound of the same form, since h(Q) is the best attainable 
divergence rate, and, by Lemma [2] log likelihood ratios per unit time are con- 
verging to divergence rates for n -almost-all 9, so values of 9 for which h(9) are 
close to h(Q) should come to dominate the integral in H n (R t ). It would then 
be fairly straightforward to show convergence of the posterior distribution. 

Unfortunately, additional assumptions are required for such an upper bound, 
because (as earlier remarked) Lemma [2] does not give uniform convergence, 
merely universal convergence; with a large enough space of hypotheses, the 
slowest pointwise convergence rates can be pushed arbitrarily low. For instance, 
let x\ be the distribution on S°° which assigns probability 1 to endless repeti- 
tions of x\; clearly, under this distribution, seeing X\ = x\ is almost certain. If 
such measures fall within the support of Ho, they will dominate the likelihood, 
even though h(x\) — oo under all but very special circumstances (e.g., P — x\). 
Gcncrically, then, the likelihood and the posterior weight of x\ will rapidly 
plummet at times T > t. To ensure convergence of the posterior, overly-flexible 
measures like the family of ir*'s must be either excluded from the support of 
no (possibly because they are excluded from 0), or be assigned so little prior 
weight that they do not end up dominating the integrated likelihood, or the 
posterior must concentrate on them. 
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3.2. Convergence of Posterior Density via Control of the Integrated 
Likelihood 

The next two lemmas tell us that sets in of exponentially-small prior measure 
make vanishingly small contributions to the integrated likelihood, and so to the 
posterior. They do not require assumptions beyond those used so far, but their 
application will. 



Lemma 4 Make Assumptions \l\\3j and chose a sequence of sets Bt C such 
that, for all sufficiently large t, Ho(B t ) < aexp{— tf3} for some a,/3 > 0. Then, 
almost surely, 

Tl o (R t B t )<exp{-0/2} (8) 

for all but finitely many t. 

Proof: By Markov's inequality. First, use Fubini's theorem and the chain 
rule for Radon-Nikodym derivatives to calculate the expectation value of the 
ratio. 

E[U (R t B t )] = f dP(x{) f dn (6)R t {d) 

JX t JB, 



dll (9) I dP{x\)^{x\) 

B t JX™ 



dH (e) / dF e {x\) 
dll {9) 

B t 

= n (B t ) 

Now apply Markov's inequality: 

P {x\ : Ho (R t B t ) > exp {-tp/2}} < exp {i/3/2}E [H (R t B t )] 

= exp{t(3/2}U (B t ) 
< aexp{-t/3/2} 

for all sufficiently large t. Since these probabilities are summable, the Borel- 
Cantelli lemma implies that, with probability 1, Eq. [8] holds for all but finitely 
many t. □ 

The next lemma asserts a sequence of exponentially-small sets makes a (log- 
arithmically) negligible contribution to the posterior distribution, provided the 
exponent is large enough compared to h(&). 

Lemma 5 Let B t be as in the previous lemma. If (3 > 2h(@), then 

1 (9) 



n {RtB^) 



n (R t ) 
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Proof: Begin with the likelihood integrated over B t rather than its comple- 
ment, and apply Lemmas [3] and [4] for any e > 

n {B t R t ) cxp{-t/j/2} 

n (i? n ) - exp{-t[h(e) + e}} { ' 

= cxp{t[e + h(Q)- 13/2}} (11) 

provided t is sufficiently large. If (3 > 2h(<d), this bound can be made to go to 
zero as t — > oo by taking e to be sufficiently small. Since 

ILo(Rt) =TL (B$R t ) + Tl Q (B t R t ) 

it follows that 

n (B-R t ) 

□ 



Lemma 6 Make Assumptions and take any set G on which the conver- 
gence in Eq. [7] is uniform and where Uq(G) > 0. Then, P-a.s., 



limsup - log-IIo (GR t ) < -h{G) (12) 

t^oo t 



Proof: Pick any e > 0. By the hypothesis of uniform convergence, there 
almost surely exists a T(e) such that, for all t > T(e) and for all 8 e G, 
t- 1 log i? t (6») < -h(6) +e. Hence 

t- 1 lo g n (Gi? f ) = t- 1 lo g n (Gexp{logi? t }) (13) 
< ^ 1 lo g n (Gexp{t[-/i + e]}) (14) 
= e + ^ 1 logn (Gexp{-</i}) (15) 

Let n | G denote the probability measure formed by conditioning IIo to be in 
the set G. Then 

n (Gz) = n (G) f <m QlG (0)z(6) 

Jg 

for any integrable function z. Apply this to the last term from Eq. |15| 
logn (Gexp{-^}) = lo g n (G) + log f <m olG (6)exp{-th(6)} 

JG 

The second term on the right-hand side is the cumulant generating function of 
—h{9) with respect to n |G, which turns out (cf. [6]) to have exactly the right 
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behavior as t 



i/t\ * 



hogJ G dJL ol G(e)exp{-th(6)} = -logjf <flI O | G (0)|exp{-/i(0)}| t 

= ^log^£dn O | G (0)|exp{-Me)}r 

= ^(tlog||exp{-^)}|| t ,n 0|G ) 

= log||exp{-^)}ll t ,n 0|G (16) 

Since h(8) > 0, cxp{— h(6)} < 1, and the L p norm of the latter will grow 
towards its norm as p grows. Hence, for sufficiently large t, 

log||exp{-M^)}lli,n 0|G < log||exp{-^)}IL,no| +e 

= — ess inf h(6) + e 
eeG 

= -h(G) + e (17) 

where the next-to-last step uses the monotonicity of log and exp. 

Putting everything together, we have that, for any e > and all sufficiently 
large t, 

r 1 logllo (GRt) < -h(G) + 2e + logU ^ (1 8 ) 
Hence the limit superior of the left-hand side is at most —h(G). □ 
Lemma 7 Under Assumption U[ 



limsup|lo g n (R t ) < -h(G) (19) 



Proof: By Lemma [5j 

n (G t R t ) 



lim J^ = l 



implying that 



lim lo g n (Rt) - lo g n (G t R t ) = 

t — >oo 



so for every e > 0, for t large enough 

lo g n (Rt) <e/3 + lo g n (G t Rt) 
Consequently, again for large enough t, 

j log n (Rt) < e/3t + l - log n (G t Rt) 

Now, for each set G, for every e > 0, if t > t(G, e/3) then 

i log n (GR t ) <-h(G) + e/3 
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by Lemma[6j By Assumption [6j t > r(G t , e/3) for all sufficiently large t. Hence 
- t lo g n (R t ) < -h(G t ) + e/M + e/3 

for all e > and all t sufficiently large. Since, by Assumption [5] h(G t ) — > h(Q), 
for every e > 0, h{Gt) is within e/3 of h(Q) for large enough t, so 

- lo g n (iZt) < -/i(9) + e/3< + e/3 + e/3 

Thus, for every e > 0, then we have that 

hogll (R t ) < -h(Q) + e 

for large enough t, or, in short, 

limsup^-logn (i? t ) < -h{&) 

a 

Lemma 8 Under Assumptions if IIo(/) = 0, then 

iiogno(j2t)->-fc(e) (20) 

almost surely. 

Proof: Combining Lemmas |3] and [7J 

-ft,(6) < liminf - logn (Rt) < limsup - logn (Rt) < -h(Q) 

t->oo t J^oo t 

□ 

The standard version of Egorov's theorem concerns sequences of finite mea- 
surable functions converging pointwise to a finite measurable limiting function. 
However, the proof is easily adapted to an infinite limiting function. 

Lemma 9 Let ft(0) be a sequence of finite, measurable functions, converging 
to oo almost everywhere (Tlo/ on I. Then for each e > 0, there exists a possibly- 
empty B <Z I such that Hq(B) < e, and the convergence is uniform on I \ B. 



Proof: Parallel to the usual proof of Egorov's theorem. Begin by removing 
the measure-zero set of points on which pointwise convergence fails; for simplic- 
ity, keep the name / for the remaining set. For each natural number t and k, let 
B t: k = {& £ I ■ ft(ff) < k} — the points where the function fails to be at least 
k by step t. Since the limit of /* is oo everywhere on I, each 9 has a last t such 
that ft(0) < k, no matter how big k is. Hence Ht^i Bt.k = 0- By continuity of 
measure, for any S > 0, there exists an n such that Ho (£?(,&) < S if t > n. Fix e 
as in the statement of the lemma, and set 6 — e2~ k . Finally, set B — (JfeLi B n ,k- 
By the union bound, Hq(B) < e, and by construction, the rate of convergence 
to oo is uniform on / \ B. □ 
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Lemma 10 The conclusion of Lemma^is unchanged ifH (I) > 0. 

Proof: The integrated likelihood ratio can be divided into two parts, one 
from integrating over / and one from integrating over its complement. Previous 
lemmas have established that the latter is upper bounded, in the long run, by a 
quantity which is (9(exp {— h(Q)i}). We can use Lemma[9]to divide I into a se- 
quence of sub-sets, on which the convergence is uniform, and hence on which the 
integrated likelihood shrinks faster than any exponential function, and remain- 
der sets, of prior measure no more than aexp {— n/3}, on which the convergence 
is less than uniform (i.e., slow). If we ensure that (3 > 2h(Q), however, by 
Lemma [5] the remainder sets' contributions to the integrated likelihood is negli- 
gible in comparison to that of \ /. Said another way, if there are alternatives 
which a consistent test would rule out at a merely exponential rate, those which 
would be rejected at a supra-exponential rate end up making vanishingly small 
contributions to the integrated likelihood. □ 

Theorem 2 Under Assumptions^^ for all G O where 7To(0) > 0, 

Km *lo g7 r t (0) = -J(0) (21) 

t— »oo t 

with probability 1. 

Proof: Theorem [I] says that, for all 0, 

limsup -log7T((0) < ~J(9) 

t — >oo t 

a.s., so there just needs to be a matching liminf. Pick any e > 0. By Assumption 
[3j it's almost certain that, for all sufficiently large t, 

\ log Rt{0) > -h(9)-e/2 

while by Lemma [TUJ it's almost certain that for all sufficiently large t, 

^logrio (R t ) < -h{Q) + e/2 

Combining these as in the proof of Theorem [l] it's almost certain that for all 
sufficiently large t 

-j: log TTt(9) > h(Q) — h(8) — e 

so 



liminf -log 7r t (0) > h(O) - h{8) = -J(0) 

t— >oo t 



a 
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3.3. Convergence and Large Deviations of the Posterior Measure 

Adding Assumption [7] to those before it implies that the posterior measure 
concentrates on sets ic6 where h(A) = h(Q). 



Theorem 3 Make Assumptions^^ Pick any set A £ T where Ho (A) > and 
h(A) > h(Q). Then U t (A) -► a.s. 

Proof: 

U t (A) = U t (AnG t )+U t (AnG c t ) 

< n ( (4nG t )+n f (G t c ) 

The last term is easy to bound. From Eq. [TT]in the proof of Lemma [5] 

n (R t G c t ) 



n t (G?) 



n (R t ) 

< exp{t[e + h(Q) -/3/2]} (22) 



for any e > 0, for all sufficiently large t, almost surely. Since (3 > 2h(Q), the 
whole expression — > as t — > oo. 

To bound n t (yl n G t ), reasoning as in the proof of Lemma [7J but invoking 
Assumption [7j leads to the conclusion that, for any e > 0, with probability 1, 

jlogU (R t (Ar)G t )) < -h(A)+e 

for all sufficiently large t. Recall that by Lemma [3j for all e > it's almost sure 
that 

ilogn (Rt) > -h(Q)-e 

for all sufficiently large n. Hence for every e > 0, it's almost certain that for all 
sufficiently large t, 

U t {A n G t ) < exp {t[h(Q) - h(A) + 2e]} (23) 

Since h(A) > h(Q), by picking e small enough the right hand side goes to zero. 
□ 

The proof of the theorem provides an exponential upper bound on the pos- 
terior measure of sets where h(A) > h(Q). In fact, even without the final as- 
sumption needed for the theorem, there is an exponential lower bound on that 
posterior measure. 

Lemma 11 Make Assumption and pick a set A S T with LTo(^4) > 0. 
Then 

Iiminf-logn t (A) > - J (A) (24) 

t — ►oc t 
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Proof: Reasoning as in the proof of Lemma [3] it is easy to see that 
liminf - logllo (RtA) > -h(A) 

t— too t 



and by Lemma [7] 



limsu P ylogn (i? t ) < -h(Q) 

t — too £ 



hence 



lim inf - log IL (A) = lim inf - log - - 
t^oo t & y ' t^oo t & H (Rt) 

> -h(A) + h(Q) 

□ 

Theorem 4 Under the conditions of Theorem^ if A 6 T is such that 

- lim sup t' 1 logn (^l n G c t ) = 0> 2h(A) (25) 

then 

lim - logII t (A) = h(Q) - h(A) (26) 

t — too t 

In particular, this holds whenever 2h(A) < j3 or A C Hfeln ^ k f or some n - 
Proof: Trivially, 

- t log 11* (A) - \ log U t (A n G t ) + U(A n G c t ) 

From Eq. [23] from the proof of Theorem [3] we know that, for any e > 0, 

11* (A n G t ) < exp {t[h(B) - h(A) + e]} 

a.s. for sufficiently large t. On the other hand, under the hypothesis of the 
theorem, the proof of Eq. 22 can be imitated for H t (AnGt), with the conclusion 
that, for all e > 0, 

11* (4 n G^) < exp {t[h(Q) - (3'/2 + e]} 



again a.s. for sufficiently large t. Since /3'/2 > h(A), Eq. 26 follows. 

Finally, to see that this holds for any A where h(A) < (3/2, observe that we 
can always upper bound Ut(AnGt) by n t (Gt), and the latter goes to zero with 
rate at least —{3/2. □ 

Remarks: Because h(A) is the essential infimum of h(ff) on the set A, as 
the set shrinks h(A) grows. Sets where h(A) is much larger than h(Q) tend 
accordingly to be small. The difficulty is that the sets G\ are also small, and 
conceivably overlap so heavily with A that the integral of the likelihood over A is 
dominated by the part coming from ACiG^. Eventually this will shrink towards 
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zero exponentially, but perhaps only at the comparatively slow rate h(<3) — (3/2, 
rather than the faster rate h(Q) — h(A) attained on the well-behaved part AdGt- 
Theorem [4] is close to, but not quite, a large deviations principle on 0. We 
have shown that the posterior probability of any arbitrary set A where J (A) > 
goes to zero with an exponential rate at least equal to 

(3/2 A ess inf J(9) = ess inf 0/2 A J(6) (27) 

But in a true LDP, the rate would have to be an infimum, not just an essential 
infimum, of a point-wise rate function. This deficiency could be removed by 
means of additional assumptions on n and h(9). 

Ref. |22| obtains proper large and even moderate deviations principles, but 
for the location of II f in the space Aii(Q) of all distributions on 0, rather 
than on itself. Essentially, they use the assumption of IID sampling, which 
makes the posterior a function of the empirical distribution, to leverage the LDP 
for the latter into an LDP for the former. This strategy may be more widely 
applicable but goes beyond the scope of this paper. Papangclou [49 , assuming 
that consists of discrete- valued Markov chains of arbitrary order and P is in 
the support of the prior, and using methods similar to those in Appendix [B] 
derives a result which is closely related to Theorem [4] In fact, fixing the sets Gt 
as in Appendix |b) Theorem [4] implies the theorem of |49j . 



3. 4- Generalization Performance 

Lemma [10] shows that, in hindsight, the Bayesian learner does a good job of 
matching the data: the log integrated likelihood ratio per time-step approaches 
—h(Q), the limit of values attainable by individual hypotheses within the sup- 
port of the prior. This leaves open, however, the question of the prospective or 
generalization performance. 

What we want is for the posterior predictive distribution to approach the 
true conditional distribution of future events P , but we cannot in general hope 
for the convergence to be complete, since our models are mis-specified. The next 
theorem uses h(&) to put an upper bound on how far the posterior predictive 
distribution can remain from the true predictive distribution. 

Theorem 5 Under Assumptions^^ with probability 1, 

limsupp^P*,^) < h(Q) (28) 

t— »oo 

Hmsupp| v (P t ,F r i) < 4ft(9) (29) 

t — >oo 

where pu and pxv al "G, respectively, the Hellinger and total variation metrics. 

Proof: Recall the well-known inequalities relating Hellinger distance to 
Kullback-Leibler divergence on the one side and to total variation distance on 
the other |30| : for any two distributions P and Q. 

P 2 h(P,Q) < D(P\\Q) (30) 
Prv(P,Q) < 2 PH {P,Q) (31) 



C. R. Shalizi/ 'Dynamics of Bayesian Updating 



21 



It's enough to prove Eq. [28} and Eq. [29] then follows from Eq. [3T| 

Abbreviate pH(P t ,Fg) by pn(t,9). Pick any e > h(Q), and say that A e 
{9 : p 2 H {t,9) > e}. By convexity and Jensen's inequality, 



Ph(P\F^) < f P 2 H (t,9)dU n (9) 



e 

p 2 H (t,9)dTL n {6) + [ p 2 H {t,9)<m n {9) 

Al J A e 

= eU t (A c e ) + V2U t (A e ) 

By Eq. [30] d(9) > p 2 H (t,9). Thus h(A e ) > e, and e > h{9) so, by Theorem [3} 
n t (A e ) — > a.s. Hence 

eventually almost surely. Since this holds for any e > h(Q), Eq. 28 follows. □ 

Remark: It seems like it should be possible to prove a similar result for the 
divergence rate of the predictive distribution, namely that 

limsup/i(n t ) < h(Q) 

t — >oo 

but it would take a different approach, because h(-) has no upper bound, and 
the posterior weight of the high-divergence regions might decay too slowly to 
compensate for this. 



3.5. Rate of Convergence 

Recall that N e was defined as the set of all 9 such that h(9) < h(Q) + e. 
(This is measurable by Assumption [2]) The set iV e c thus consists of all hypothe- 
ses whose divergence rate is more than e above the essential infimum h(Q). 
For any e > 0, H t (N°) — * a.s., by Theoremj3l and for sufficiently small e, 
lim^oo t^ 1 logn f (iV e c ) = — e a.s., by Theorem WjFor such sets, in other words, 
for any S > 0, it's almost certain that for all sufficiently large t, 

n f (iV e c ) <exp{-i(e-5)} (32) 

Now consider a non-increasing positive sequence et — * 0. Presumably if decays 
slowly enough, Il^iV,?) will still go to zero, even though the sets JVf are non- 
decreasing. Examination of Eq. [32] suggests, naively, that this will work if ie t — > 
00, i.e., if the decay of e t is strictly sublinear. This is correct under an additional 
condition, similar to Assumption [6] 

Theorem 6 Make assumptions^^ an d pick a positive sequence e t where et — > 
0, te t — + 00. //, for each S > 0, 

r(G t nN° t ,5)<t (33) 

eventually almost surely, then 

n t (JV et )->l (34) 



with probability 1. 
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Proof: By showing that H t (N£ ) — * a.s. Begin by splitting the sets into 
the parts inside the Gt, say Ut, and the parts outside: 

U t (ND = U t (NlDG t )+U t (Nlr)G c t ) 
< U t (U t ) + U t (G c t ) 

From Lemma [4] the second term — ► with probability 1, so for any rji > 0, it 
is < r\\ eventually a.s. 

Turning to the other term, Theorem [4] applies to Uk for any fixed k, so 

lim r 1 io g n t (c/ fe ) = h(e) - h(u k ) 

t — >oo 



(a.s.), implying, with Lemma 10 that 



lim ^logllo (U k R t ) = -h{U k ) 

t — >oo 



(a.s.). By Eq. 33 for any 772 > 0, 

r 1 logll (U t R t ) <-h(U t ) + V2 
eventually almost surely. By Lemma [10] and Bayes's rule, then, 

t-Hogu^Ut) <h(e)-h(u t ) + m 

eventually a.s., for any r/ 3 > 0. Putting things back together, eventually a.s., 

II t (JV e c t ) < exp{t(h(Q)-h(U t ) + r) 2 )} + m 
< exp{t(-e t + 773)} + m 

Since tet 00, the first term goes to zero, and since 771 can be as small as 
desired, 

almost surely. □ 

The theorem lets us attain rates of convergence just slower than t^ 1 (so 
that te t — > 00). This matches existing results on rates of posterior convergence 
for mis-specified models with IID data in [68, Corollary 5.2] (r; -1 in the Renyi 
divergence) and in [35] (rj -1 / 2 in the Hellinger distance; recall Eq. 30 1, and for 
correctly-specified non-IID models in (t~ a for suitable a < 1/2, again in 
the Hellinger distance). 



3. 6. Application of the Results to the Example 

Because h(<d) = 0, while h(6) > everywhere, the behavior of the posterior is 
somewhat peculiar. Every compact set K C G has J(K) > 0, so by Theorem 
[3j n t (A") — * 0. On the other hand, H t (G t ) — ► 1 — the sequence of good sets 
contains models of increasingly high order, with increasingly weak constraints on 
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the transition probabilities, and this lets its posterior weight grow, even though 
every individual compact set within it ultimately loses all weight. 

In fact, each Gt is a convex set, and h(-) is a convex function, so there is a 
unique minimizer of the divergence rate within each good set. Conditional on 
being within Gt, the posterior probability becomes increasingly concentrated on 
neighborhoods of this minimizer, but the minimizer itself keeps moving, since it 
can always be improved upon by increasing the order of the chain and reducing 
some transition probabilities. (Recall that P gives probability to sequences 
010, OHIO, etc., where the block of l's is of odd length, but O contains only 
chains with strictly positive transition probabilities.) 

Outside of the good sets, the likelihood is peaked around hypotheses which 
provide stationary and smooth approximations to the x\ distribution that end- 
lessly repeats the observed sequence to date. The divergence rates of these hy- 
potheses are however extremely high, so none of them retains its high likelihood 
for very long. (x\ is a Markov chain of order t, but it is not in O, since it's 
neither stationary nor does it have strictly positive transition probabilities. It 
can be made stationary, however, by assigning equal probability to each of its 
t states; this gives the data likelihood l/t rather than 1, but that still is vastly 
larger than the O(-ct) log-likelihoods of better models. (Recall that even the 
log-likelihood of the true distribution is only 0(— §£).) Allowing each of the t 
states to have a probability < l <C 1 of not proceeding to the next state in the 
periodic sequence is easy and leads to only an 0(d) reduction in the likelihood 
up to time t. In the long run, however, it means that the log-likelihood will be 
O(ilogt).) In any case, the total posterior probability of G c t is going to zero 
exponentially. 

Despite — or rather, because of — the fact that no point in is the ne plus 
ultra around which the posterior concentrates, the conditions of Theorem [5] are 
met, and since h(Q) — 0, the posterior predictive distribution converges to the 
true predictive distribution in the Hellinger and total variation metrics. That is, 
the weird gyrations of the posterior do not prevent us from attaining predictive 
consistency. This is so even though the posterior always gives the wrong answer 
to such basic questions as "Is P(Xl +2 = 010) > 0?" — inferences which in this 
case can be made correctly through non-Bayesian methods [UJ 155] , 

4. Discussion 

The crucial assumptions were [3] [5] and [6j Together, these amount to assuming 
that the time-averaged log likelihood ratio converges universally; to fashioning a 
sieve, successively embracing regions of O where the convergence is increasingly 
ill-behaved; and the hope that the prior weight of the remaining bad sets can 
be bounded exponentially. 

Using asymptotic equipartition in place of the law of large numbers is fairly 
straightforward. Both results belong to the general family of ergodic theorems, 
which allow us to take sufficiently long sample paths as representative of entire 
processes. The unique a.s. limit in Eq.[l]can be replaced with a.s. convergence to 
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a distinct limit in each ergodic component of P. However, the notation gets ugly, 
so the reader should regard h(0) as that random limit, and treat all subsequent 
results as relative to the ergodic decomposition of P. (Cf. [3T] [T7].) It may 
be possible to weaken this assumption yet further, but it is hard to see how 
Bayesian updating can succeed if the past performance of the likelihood is not 
a guide to future results. 

A bigger departure from the usual approach to posterior convergence may 
be allowing h(Q) > 0; this rules out posterior consistency, to begin with. More 
subtly, it requires (3 > 2h(Q). This means that a prior distribution which satisfies 
the assumptions for one value of P may not satisfy them for another, depending, 
naturally enough, on just how mis-specified the hypotheses are, and how much 
weight the prior puts on very bad hypotheses. On the other hand, when /i(O) = 
0, Theorem [5] implies predictive consistency, as in the example. 

Assumption [6] is frankly annoying. It ensures that the log likelihood ratio 
converges fast enough, at least on the good sets, that we can be confident that 
integrated likelihood of Gt has converged well by the time we want Gt to start 
dominating the prior. It was shaped, however, to fill a hole in the proof of 
Lemma [7] rather than more natural considerations. The result is that verify- 
ing the assumption in its present form means proving the sub-linear growth 
rate of sequences of random last entry times, and these times are not gener- 
ally convenient to work with. (Cf. Appendix |B] ) It would be nice to replace 
it with a bracketing or metric entropy condition, as in [U [55] or similar forms 
of capacity control, as used in [35] IS3]- Alternately, the uniformly consistent 
test conditions widely employed in Bayesian nonparametrics [35] [57] have been 
adapted the mis-specified setting by [35] , where the tests become reminiscent of 
the "model selection tests" used in econometrics [64] . Since the latter can work 
for dynamical models [51] . this approach may also work here. In any event, re- 
placing Assumption [6j with more primitive, comprehensible and easily- verified 
conditions seems a promising direction for future work. 

These results go some way toward providing a frequentist explanation of the 
success of Bayesian methods in many practical problems. Under these condi- 
tions, the posterior is increasingly weighted towards the parts of which are 
closest (in the Kullback-Lcibler sense) to the data-generating process P. For a 
Tit (A) to persistently be much more or much less than » exp {— t J (A)}, R{9) 
must be persistently far from exp {— th(0)}, not just for isolated 6 £ A, but a 
whole positive-measure subset of them. With a reasonably smooth prior, this 
requires a run of bad luck amounting almost to a conspiracy. From this point 
of view, Bayesian inference amounts to introducing bias so as to reduce vari- 
ance, and then relaxing the bias. Experience with frequentist non-parametric 
methods shows this can work if the bias is relaxed sufficiently slowly, which is 
basically what the assumptions here do. As the example shows, this can succeed 
as a predictive tactic without supporting substantive inferences about the data- 
generating process. However, [2J]7] involve both the prior and the data-generating 
process, and so cannot be verified using the prior alone. For empirical applica- 
tions, it would be nice to have ways of checking them using sample data. 

When h(Q) > and all the models are more or less wrong, there is an addi- 
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tional advantage to averaging the models, as is done in the predictive distribu- 
tion. (I owe the argument which follows to Scott Page; cf. 05].) With a convex 
loss function £, such as squared error, Kullback-Leibler divergence, Hellingcr 
distance, etc., the loss of the predictive distribution £(Ht) will be no larger than 
the posterior-mean loss of the individual models II t (£(0)). For squared error 
loss, the difference is equal to the variance of the models' predictions [30] ■ For 
divergence, some algebra shows that 



h(ii t ) = n t (/i(0)) + n t e 



, dF e 
log 



(35) 



where the second term on the RHS is again an indication of the diversity of the 
models; the more different their predictions are, on the kind of data generated 
by P, the smaller the error of made by the mixture. Having a diversity of 
wrong answers can be as important as reducing the average error itself. The way 
to accomplish this is to give more weight to models which make mostly good 
predictions, but make different mistakes. This suggests that there may actually 
be predictive benefits to having the posterior concentrate on a set containing 
multiple hypotheses. 

Finally, it is worth remarking on the connection between these results and 
prediction with "mixtures of experts" [5] [TO] . Formally, the role of the negative 
log-likelihood and of Bayes's rule in this paper was to provide a loss function and 
a multiplicative scheme for updating the weights. All but one of the main results 
(Theorem [5] which bounds Hellinger distance by Kullback-Leibler divergence) 
would carry over to multiplicative weight training using a different loss function, 
provided the accumulated loss per unit time converged. 



Appendix A: Bayesian Updating as Replicator Dynamics 

Replicator dynamics are one of the fundamental models of evolutionary biology; 
they represent the effects of natural selection in large populations, without (in 
their simplest form) mutation, sex, or other sources of variation. [33] provides a 
thorough discussion. They also arise as approximations to many other adaptive 
processes, such as reinforcement learning [S] [2 [M] . m this appendix, I show that 
Bayesian updating also follows the replicator equation. 

We have a set of replicators — phenotypes, species, reproductive strategies, 
etc. — indexed by 9 e 6. The population density at type 9 is n(9). We denote 
by <j)t{9) the fitness of 9 at time t, i.e., the average number of descendants left 
by each individual of type 9. The fitness function <p t may in fact be a function of 
7r t , in which case it is said to be frequency-dependent. Many applications assume 
the fitness function to be deterministic, rather than random, and further assume 
that it is not an explicit function of t, but these restrictions arc inessential. 

The discrete-time replicator dynamic [35J is the dynamical system given by 
the map 

^ ( ) = ^_ l( 0)^gL (36) 

n* (0*) 
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where Il t (</> t ) is the population mean fitness at t, i.e., 

n t (<f> t ) = f M0)d7T t (e) 



The effect of these dynamics is to re-weight the population towards replicators 
with above-average fitness. 

It is immediate that Bayesian updating has the same form as Eq. 36 as soon 
as we identify the distribution of replicators with the posterior distribution, and 
the fitness with the conditional likelihood. In fact, Bayesian updating is an extra 
simple case of the replicator equation, since the fitness function is frequency- 
independent, though stochastic. Updating corresponds to the action of natural 
selection, without variation, in a fluctuating environment. The results in the 
main text assume (Assumption |3| that, despite the fluctuations, the long-run 
fitness is nonetheless a determinate function of 9. The theorems assert that 
selection can then be relied upon to drive the population to the peaks of the 
long-run fitness function, at the cost of reducing the diversity of the population, 
rather as in Fisher's fundamental theorem of natural selection [23] 

Corollary 1 Define the relative fitness 4>t{9) = L t (9)/Tlt (L t ). Under the con- 
ditions of Theorem^ the time average of the log relative fitness converges a.s. 



1 

-£>g£ B (0)--J(0)+o(l) (37) 

71=1 

Proof: Unrolling Bayes's rule over multiple observations, 

t 

TT t (9) = 7T O (0) J] Me) 



71=1 



Take log of both sides, divide through by t, and invoke Theorem [2j □ 
Remark: Theorem [2] implies that 

H t = \logn t (9) + tJ(9)\ 



is a.s. o(t). To strengthen Eq. 37 from convergence of the time average or Cesaro 
mean to plain convergence requires forcing H t — H t _i to be o(l), which it gen- 
erally isn't. 

It is worth noting that Haldane |33j defined the intensity of selection on a 
population as, in the present notation, 

log / 

MO) 

where 9 is the "optimal" (i.e., most selected-for) value of 9. For us, this intensity 
of selection is just Rt(9)/Ho (Rt) where 9 is the (or a) MLE. 
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Appendix B: Verification of Assumptions 5—7 for the Example 

Since the X%° process is a function of the process, and the latter is an 
aperiodic Markov chain, both are ■i/'-mixing (see [44] [60] for the definition of 
^-mixing and demonstrations that aperiodic Markov chains and their functions 

(k) 

are ^-mixing) . Let P t be the empirical distribution of sequences of length fc 
obtained from x\. For a Markov chain of order k, the likelihood is a function of 

P;* T1 ' alone; we will use this and the ergodic properties of the data-generating 
process to construct sets on which the time-averaged log-likelihood converges 
uniformly. Doing this will involve constraining both the order of the Markov 
chains and their transition probabilities, and gradually relaxing the constraints. 

It will simplify notation if from here on all logarithms are taken to base 2. 

Pick e > and let k(t) be an increasing positivc-integer-valucd function of t, 
k(t) — > oo, subject to the limit k(t) < , where hp is the Shannon entropy 
rate of P, which direct calculation shows is 2/3. The ^-mixing property of Xf° 
implies [SOI p. 179] that 

P(p TV (Pf®, P {Ht)) ) >S)< ^2(n + l) tT1 2- nC ^ 2 (38) 

where pxv is total variation distance, p( k W> is P's restriction to sequences of 
length k(t), n — [t/k(t)\ —1,7! = (h P + e/2)/(hp + e) and C-y is a positive 
constant specific to P (the exact value of which is not important). 
The log-likelihood per observation of a Markov chain 9 E Ok is 

t- 1 logf 9 (x t 1 )^t- 1 logfe(x k 1 )+ £P t (fe+1) ( W a) log MaH 

where fe(a\w) is of course the probability, according to 9, of producing a after 
seeing w. By asymptotic equipartition, this is converging a.s. to its expected 
value, —hp — h{9). 

Let z{9) = max IO , a |log/e(a|«;)|. If z{9) < z a and p TV (P ( (fc+1) , p( fc+1 )) < 
S, then i -1 log fe(x\) is within Zq8 of — hp — h(9). Meanwhile, t -1 logp(x\) is 
converging a.s. to —hp, and again [50] 

P{\t- 1 \ogp{X\)-h P \ >8) <q(t,6)2- tC2S (39) 

for some C% > and sub-exponential q(t,5). (The details are unilluminating in 
the present context and thus skipped.) 

Define G(t, Zq) as the set of all Markov models whose order is less than or 
equal to k(t) — 1 and whose log transition probabilities do not exceed zq, in 
symbols 

A(t)-i 

G{t, z ) = {9 : z(9) <z„}nl \J 6, 
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Combining the deviation-probability bounds 38 and 39 for all 9 € G(t, zq) 
lo S R t (0) 



P 



h{9) 



loet ,T1 nC 1 5 2 tC 2 S 

> 5 < -^2(n+l) 4 +q(t,S)2—*- (40) 



These probabilities are clearly summable as t — > oo, so by the Borel-Cantelli 
lemma, we have uniform almost-sure convergence of £ -1 logi? t (#) to —h{9) for 
all 9 € G(t,z ). 

The sets G(t, zq) eventually expand to include Markov models of arbitrarily 
high order, but maintain a constant bound on the transition probabilities. To 
relax this, let z t be an increasing function of t, z(t) — > oo, subject to z t < C^t 1 " 1 
for positive 72 < 7i- Then the deviation probabilities remain summable, and 
for each t, the convergence of t~ x log Rt(9) is still uniform on G(t, zt). Set Gt = 
G(t,Zt), and turn to verifying the remaining assumptions. 

Start with Assumption [5] take its items in reverse order. So far, the only 
restriction on the prior I±o has been that its support should be the whole of 
G, and that it should have the "Kullback-Leibler rate property" , giving posi- 
tive weight to every set N £ = {9 : d(9) < e}. This, together with the fact that 
lim t Gt = 0, means that h(Gt) — >• h(Q), which is item (3) of the assump- 
tion. The same argument also delivers Assumption [7] Item (2), uniform conver- 
gence on each Gt, is true by construction. Finally (for this assumption), since 
h(&) = 0, any j3 > will do, and there are certainly probability measures where 
n (Gj) < aexp{— pi} for some a,/3 > 0. So, Assumption [5] is satisfied. 

Only AssumptionMJremains. Since Assumptions[Tf|3]have already been checked, 
we can apply Eq. |18| from the proof of Lemma [6] and see that, for each fixed G 
from the sequence of Gt, for any e > 0, for all sufficiently large t, 

t^logHo (GR t ) < -h(G) + e + t- 1 logIl a (G) a.s. 

This shows that r(G t ,S) is almost surely finite for all t and S, but still leaves 
open the question of whether for every S and all sufficiently large t, t > r(Gt, 5) 
(a.s.). Reformulating a little, the desideratum is that for each S, with probability 
l,t< r(Gt, S) only finitely often. By the Borel-Cantelli lemma, this will happen 
if J2t P( T (Gt, 6) > t) < 00. However, if r(Gt,S) > t, it must be equal to some 
particular n > t, so there is a union bound: 

£ P{r{G t ,5) > t) < £ ± P ( 1OgIl0 f A) > S - h(G t )) (41) 
t t n=t+l ^ ' ' 



From the proof of Lemma 6 (specifically from Eqs.|15 16 and 17l, we can see that 
by making t large enough, the only way to have the event n~ l loglLj (G t R n ) > 
5 — h(G t ) is to have \n~ 1 log -/?„(#) — h(9)\ > 8/2 everywhere on a positive- 
measure subset of Gt ■ But we know from Eq. [40] not only that the inner sum 
can be made arbitrarily small by taking t sufficiently large, but that the whole 
double sum is finite. So t(Gj, 8) > t only finitely often (a.s.). 
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